Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

lindsay-stevens · 2025-01-09T18:58:04Z

Closes #738
Closes #731
Closes #572

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run python -m unittest and verified all tests pass
run ruff format pyxform tests and ruff check pyxform tests to lint code
verified that any code or assets from external sources are properly credited in comments

- validators/pyxform/choices.py: - rename function to make it clearer that data is being modified - add docstring, use constants / truthy-checks / tuple instead of list - xls2json.py: - use truthy-checks - use 'meta_children' list directly instead of creating 'survey_meta' list then unpacking into 'meta_children'. - xls2json_backends.py: - only make a new string (via replace) if nbsp is found

- move functions associated with dealias_and_group_headers to new module at pyxform/parsing/sheet_headers.py - avoid processing headers more than once - check for missing required headers - check for duplicates via aliases (e.g. name / value) - aliases.py - replace strings with "::" tokens, with tuples split "::", since that is all that the values are used for - since headers are normalised to snake_case, de-duplicate aliases - utils.py - use shared re_whitespace from xls2json in xls2json_backends.py - xls2json.py - use SurveyElement classes for expected columns, which helps skip some string processing if the header is already a match - skip processing external_choices and OSM sheet if no data provided - require "name" column on choices sheet - require "type" column on survey sheet - test_bug_missing_headers.py - move test cases over to similar scenarios in test_sheet_columns.py - test_translations.py - update test to match external-facing column names (e.g. errors/warnings should now say "image" instead of "media::image") - test_sheet_columns.py - add test cases for aliases and header processing

- xls2json_backends.py - xls and xlsx have provided sheet header info for a long time, but it was not used for xls2json header processing, and having it consistently available will allow for more reliable behaviour - all file types now include sheet names for spell checks - sheet names lowercased during this processing instead of xls2json - md and csv now provide header info - csv might need some further work but there are not many tests to support that refactoring, and csv is not the focus of this commit, and csv is not a primary (or well documented) file type. - add get_xlsform for more consistent file handling between xls2xform.convert (used by external API tests) and xls2json.parse_file_to_json (used by some internal API tests) - utils.py - update external_choices_to_csv to expect DefinitionData - move RE_WHITESPACE to xls2json_backends.py to avoid circular import - xls2json.py - update workbook_to_json to expect DefinitionData - remove parse_file_to_workbook_dict replaced by get_xlsform and associated functions - fold get_filename since this func is one line used in one place - test_xls2json.py - update SURVEY fixture md to include settings, because the markdown behaviour was updated to match xls/xlsx, which is to assume that a doc with one sheet and an unknown name is the "survey". So with "settings" there, and "survey" misspellings are warned about. - also, on line 336 the test is 'Surve' because the sheet_names passed to DefinitionData is verbatim (before lowercasing), which makes the warnings/errors more accurate.

- in these cases the type is markdown which skips parsing attempts for xlsx, xls, etc.

- use of value.strip() in is_empty and _md_strp_cell creates a new string but isspace() does the same without creating a new string. Also, pre-check empty strings before isspace(). - folded markdown row process functions into _md_table_to_ss_structure, since process_md_data did a split + iterate + join, only to pass it to _md_table_to_ss_structure which did split + iterate (twice). Now it should only be split + iterated once. There is still an extra iteration in list_to_dicts to pair row tuples with headers, which could potentially be folded in too.

- xls2json.py: - process optional sheets (all but survey) only if they have data - move entities spell check from entities_parsing.py because the entity declaration is only processed if there is data - move workbook_to_json regexes to module to compile only once - delete references to original workbook_dict to allow gc - parsing/sheet_headers.py: - use the provided sheet_header info instead of gathering from rows - if no sheet_headers provided (e.g. dict xlsform with no survey_header etc) then guess headers from first 100 rows. - add docstrings, spruce up error messages, add tests for each error - validators/pyxform/choices.py: - row may not always contain "__row" key so pop instead of del

- for truthiness checks on class instances, if there is no __bool__ defined then __len__ is used, and in this case __len__ calls get_slot_names. The latter returns a static value, but it's extra function calls vs. just True.

- the Question subclasses had almost the same code in each build_xml so the common parts are now in Question._build_xml. - avoids duplicate call to self.xml_label_and_hint in InputQuestion - avoids attempting insert_xpaths for control tags (see inline comment) - avoids modifying the control dict only to splat it into node, instead set the attributes directly as node() eventually does. - applies the same pattern to xml_instance and xml_action - removed UploadQuestion._get_media_type because this was just getting a value from the control_dict, which already would go into the node. - also collapse split string in test_builder.py

- type is checked during __init__, and it shouldn't change after that

- for choices, an Itemset class is introduced which allows calculating and storing metadata about the choice list more consistently. In particular, working out whether itext is required only once. - the initial motivation was the if block in survey.py line 391, which for a form with many choice lists resulted in a large slow down due to something approaching O(n*m) + costs from string join/split. - this allowed removing: - "_itemset*" properties from MultipleChoiceQuestion - _search_lists property from Survey - functions is_label_dynamic and has_dynamic_label - in the course of implementing this it became clear that actually the Tag class never has choices so code for that is removed. The Tag class is like an Option with respect to OsmUploadQuestion. Added some OSM-related test cases to check it's not broken. - similarly since choices/itext are handled by Survey, the MultipleChoiceQuestion class doesn't need choices children either, just a reference to the Itemset for XML and JSON output. - for instances generation, the above choices changes allowed simplification of _generate_static_instances, and otherwise the changes are mainly to avoid repeating checks or using intermediate lists by instead using generators/yield as much as possible. - test_j2x_creation.py / test_j2x_question.py / strings.ini - updated these tests since they are using internal APIs in a way that diverges significantly from what xls2json currently emits - For example test_select_one_question_multilingual had multi-lang choice labels but the expected XML string had a reference like "/test/qname/a:label" which implies choice itemsets aren't shared which has not been the case for a while. - tried to make these tests more useful by adding xpath assertions, and unlike other tests using ss_structure they may be useful for validating/showing what dict structure xlsforms can be provided as. - test_j2x_xform_build_preparation.py - removed this test since it's not clear what the expectation is. If it was intended to check that identical choice lists from separate questions are merged, then that functionality doesn't exist, and the choices should not be provided separately per question anyway. - test_dynamic_default.py / test_translations.py - updated performance test figures. - translation test benefits most from the choices changes because it has lots of choice lists. Increase in memory most likely due to Itemset instances that now wrap each choice list. - dynamic_default slightly faster due to the instances changes and earlier commits today (e.g. not calling xml_label_or_hint twice for input/text questions, etc).

lindsay-stevens added 11 commits January 2, 2025 21:07

dev: use slots on all dataclasses for reduced memory usage

b13b84f

dev: specify xlsform data type to attempt only .md parsing in tests

7ca766e

- in these cases the type is markdown which skips parsing attempts for xlsx, xls, etc.

chg: improve performance by defining survey_element.__bool__

12d5577

- for truthiness checks on class instances, if there is no __bool__ defined then __len__ is used, and in this case __len__ calls get_slot_names. The latter returns a static value, but it's extra function calls vs. just True.

chg: remove unnecessary re-validation of question type

d5ba008

- type is checked during __init__, and it shouldn't change after that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

lindsay-stevens commented Jan 9, 2025

Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

Are you sure you want to change the base?

Case insensitivity of sheet, column, and setting names; plus more performance improvements #746

Conversation

lindsay-stevens commented Jan 9, 2025

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have: